FidaPLUS corpus of Slovenian

نویسندگان

  • Špela Arhar
  • Vojko Gorjanc
  • Simon Krek
  • Marko Stabej
چکیده

The paper describes the FidaPLUS corpus which is an upgrade of the Slovenian reference corpus. The corpus has been improved on various levels: size, up-todateness, quality of linguistic annotation (lemmatization, POS-tagging), availability and user-friendliness of the on-line concordancer. It has also been implemented in the Sketch Engine software which produces one-page automatic, corpus-based summaries of a word’s grammatical and collocational behaviour. We will describe the history of the project and present the characteristics of the corpus and its tools.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The JOS Morphosyntactically Tagged Corpus of Slovene

The JOS morphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpora: jos100k, a 100,000 word balanced monolingual sampled corpus annotated with hand validated morphosyntactic descriptions (MSDs) and lemmas, and jos1M, the 1 million word partially hand validated corpus. The two corpora have been sampled from the 600M word Slovene reference corpus FidaPLUS. The J...

متن کامل

Slovene Word Sketches

Word sketches are one-page automatic, corpus-based summaries of a word's grammatical and collocational behaviour. They were first used in the production of the Macmillan English Dictionary (Rundell 2002). At that point, they only existed for English. Today, the Sketch Engine is available, a corpus tool which takes as input a corpus of any language and corresponding grammar patterns and which ge...

متن کامل

Contents and evaluation of the first Slovenian-German online dictionary

This paper presents the first SlovenianGerman and German-Slovenian online dictionary and contains evaluation figures for its Slovenian part. Evaluations are based on coverage of a Slovenian newspaper corpus as well as on user queries.

متن کامل

BNSI Slovenian broadcast news database - speech and text corpus

This paper presents the BNSI Slovenian Broadcast News database project. The result of the project is a database with speech and text corpus oriented toward large vocabulary continuous speech recognition in general domain. The speech corpus consists of 36 hours of transcribed evening and late night news. The raw database material was captured in the archive of national broadcaster RTV Slovenia t...

متن کامل

Slovene Terminology Web Portal and the TBX-Compatible Simplified DTD/schema

The paper describes the project whose main purpose is the creation of the Slovene terminology web portal, funded by the Slovene Research Agency and the Amebis software company. It focuses on the DTD/schema used for the unification of different terminology resources in different input formats into one database available on the web. Two projects involving unification DTD/schemas were taken as the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007